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Abstract 

Background: High throughput 'omics' experiments are usually designed to compare changes observed between 
different conditions (or interventions) and to identify biomarkers capable of characterizing each condition. We 
consider the complex structure of repeated measurements from different assays where different conditions are 
applied on the same subjects. 

Results: We propose a two-step analysis combining a multilevel approach and a multivariate approach to reveal 
separately the effects of conditions within subjects from the biological variation between subjects. The approach is 
extended to two-factor designs and to the integration of two matched data sets. It allows internal variable selection to 
highlight genes able to discriminate the net condition effect within subjects. A simulation study was performed to 
demonstrate the good performance of the multilevel multivariate approach compared to a classical multivariate 
method. The multilevel multivariate approach outperformed the classical multivariate approach with respect to the 
classification error rate and the selection of relevant genes. The approach was applied to an HIV-vaccine trial 
evaluating the response with gene expression and cytokine secretion. The discriminant multilevel analysis selected a 
relevant subset of genes while the integrative multilevel analysis highlighted clusters of genes and cytokines that 
were highly correlated across the samples. 

Conclusions: Our combined multilevel multivariate approach may help in finding signatures of vaccine effect and 
allows for a better understanding of immunological mechanisms activated by the intervention. The integrative 
analysis revealed clusters of genes, that were associated with cytokine secretion. These clusters can be seen as gene 
signatures to predict future cytokine response. The approach is implemented in the R package mixOmics 
(http://cran.r-project.org/) with associated tutorials to perform the analysis 9 . 



Background 

Recent advances in high throughput omics' technolo- 
gies enable quantitative measurements of expression or 
abundance of biological molecules in a whole biological 
system. Various popular omics platforms in systems biol- 
ogy include transcriptomics, proteomics, cytomics and 
metabolomics. These experiments are usually designed to 
compare changes observed between different conditions 
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or groups and are often used to identify biomarkers capa- 
ble of characterising pathological states or response to 
treatment. 

The decreasing costs of these high-throughput plat- 
forms now enable repeated measures experiments on the 
same individuals or biological samples. Such experiments 
allow a substantial gain in information. For instance, lon- 
gitudinal designs are more powerful as they reduce the 
noise due to inter individual variability, as long as the 
correlation between repeated observations is taken into 
account. There exists an abundant literature on the anal- 
ysis of repeated measurements of omics data [1,2]. In 
this context, a common approach is to apply a univariate 
mixed model on each gene followed by multiple test- 
ing correction [3]. However, this approach disregards the 
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dependency between genes, and due to the high dimen- 
sionality of the data, numerous hypotheses tests must be 
performed. 

The mixed model approach has been used for the analy- 
sis of one single data type (e.g. gene expression). However, 
a growing number of high-throughput data are gener- 
ated in standard clinical trials. For example, the evaluation 
of HIV vaccine in phase I/II trials incorporates mea- 
surements of counts of numerous types of cell, of the 
production of intra and extracellular cytokines and of gene 
expression [4]. The integration of such multi-layer infor- 
mation can help unravel the complexities of a biological 
system, as each functional level is hypothesized to be 
related to each other [5] . However the integration of omics 
data is a challenging task. Firstly, the large number of mea- 
sured biological entities makes it very difficult to obtain 
a good overview or understanding of the system under 
study. Secondly, the small number of samples or patients 
makes statistical inference difficult and argue for using the 
maximum amount of available information. Thirdly, the 
integration of heterogeneous data represents an analyti- 
cal and numerical challenge when trying to find common 
patterns in data from different origins. 

In recent years, several multivariate approaches have 
been proposed to combine two omics data, often in 
an unsupervised framework. In contrast to univariate 
repeated measures analysis, these linear multivariate 
approaches take into account the dependency between 
genes, are able to handle large and noisy data sets and do 
not face computational issues in the high dimensional case 
as matrix inversions are avoided. Most importantly in the 
context of this study, they enable the integration of data 
coming from different platforms and provide interpretable 
visualisation tools. These approaches aim at selecting cor- 
related biological entities from two [6-11] or more data 
sets [12]. In particular, with sparse Partial Least Squares 
(sPLS) we have shown that the integrative analysis of 
large scale omics datasets could generate new knowledge 
not accessible by the analysis of a single data type alone 
[7,8]. The biological relevance of this approach has been 
illustrated recently in some studies [13,14]. 

The flexibility and versatility of PLS also enable a 
supervised framework through PLS -Discriminant Analy- 
sis (PLS -DA [15]). A variant of which has recently been 
proposed to select discriminative features that best sep- 
arate the different conditions (sPLS-DA, [16]). sPLS-DA 
was shown to give similar performances to classical clas- 
sification methods such as Machine Learning approaches 
and variants of Linear Discriminant Analysis and was 
recently applied in a biological study [17]. 

In this paper, we consider a two-step approach to model 
the correlation between repeated measurements while 
taking advantage of the multivariate approaches. We first 
propose to extract the within-sample variation [18-20] 



before analysing this transformed data set using sPLS- 
DA for a discriminant analysis or sPLS for an integrative 
analysis. 

Starting from the classical mixed-model, we present the 
principle of a multilevel analysis to extract the within- 
sample deviation of the data and we extend the approach 
to a two-factor analysis. The within data set is then anal- 
ysed with either sPLS-DA to select discriminative genes 
between the groups of subjects on a single data set, or 
with sPLS to select subsets of correlated variables from 
two data sets. A simulation study is performed which 
demonstrates the good performance of multilevel sPLS- 
DA compared to a classical sPLS-DA. The approach is 
then illustrated on an HIV vaccination study, where the 
effect of a lipopeptide based vaccine was explored by mea- 
suring before and after vaccination various components 
of the immune response, including gene expression and 
cytokine secretion. These repeated measurement were 
made in several in vitro conditions on Peripheral Blood 
Mononuclear Cells: 'NS' (no stimulation); HIV Gag pep- 
tides 'GAG+' (peptides included in the vaccine), HIV Gag 
peptides 'GAG-' (peptides not included in the vaccine) 
and 'LIP05' (all five peptides included in the vaccine). 

Methods 

Notations 

Let X(N x p) and Z(N x q) represent two data matrices 
(e.g. gene expression and cytokine secretion). We denote 
by N the total number of samples (or rows) in the data, 
and by n the number of experimental units (or unique 
subjects), p (resp. q) is the total number of genes (resp. 
cytokines), also called variables or predictors. The dummy 
matrix Y(N x G) indicates the group/treatment of the 
samples, with G the total number of groups. 

Multilevel approach 

We first present the mixed-effect model as a pedagogi- 
cal tool and then introduce the concept of the multilevel 
approach based on the "split-up" variation. Despite the 
fact that some similarities exist between the mixed-effect 
model and the multilevel "split- up" variation approach, we 
emphasize that the latter is performed completely inde- 
pendently from the estimation of the mixed-effect model. 
Moreover, the mixed model relies on certain assumptions 
(such as Gaussian distribution of random effects) that the 
split-up variation approach does not require. 

Mixed-effect model 

Let J(A be the gene expression of a given gene k for subject 
s with stimulation In this context, the mixed model is 
defined by: 

X k sj = + n k +6*, s = 1, . . . , n, j = 1, . . . , G s 

= // + af + 7T* + € k sj 
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where for a given gene /<, \i- measures the fixed effect 
of stimulation which can be further decomposed into 
//A, the overall mean stimulation effect, plus ot k which is 

the differential effect for stimulation /. The n k are inde- 
pendent random variables following a normal distribu- 
tion A/*(0, cr£ k ), which take into account the dependency 
between the repeated measures made on the same subject 
s, the residuals e k j are independent random variables fol- 
lowing a A/"(0, a? ) distribution. Note that the number of 
stimulation for each subject (G s ) may differ. However, for 
each subject s we observe no more than one observation 
for each stimulation thus the subject effect interac- 
tions with the stimulation factor are confounded with the 
residuals. We also assume that n k and e k - are independent. 

This model is known also as the one-way unbalanced 
random-effects ANOVA. A simple approach for identi- 
fying differentially expressed (DE) genes in this model is 
to test the stimulation effect for each gene and apply a 
multiple testing correction (FDR from [21] set to 5%). Fol- 
lowing this global test, pairwise comparison can then be 
applied between two stimulations, followed by multiple 
correction (e.g. FDR, 5%). Some limitations of this stan- 
dard approach are discussed in the Results and discussion 
Section. The main advantage of the mixed model is the 
introduction of the random element n k , which is specific 
to the subject s and represents the between-subject devi- 
ation. In the same spirit, the multilevel approach based 
on the split-up variation focusses on separating the dif- 
ferent sources of variation: the within- subject deviation 
("variation") and the between-subject deviation. 

Split-up variation 

As suggested by Westerhuis et al. [19] in the mixed model 
framework, the observation x k - can be decomposed into: 

4 = ^ + + (4 - *£) (1) 

offset between-subject deviation within-subject deviation 

where x k = ^ E E 4 and = h ? 4* The ofF " 

7=1 s=l 5 7=1 

set term x k is an estimation of //A, the between-subject 
deviation is an estimation of n k and the within-subject 
deviation is an estimation of a k + € k - and can be further 
decomposed as: 

(4 - 4) = (4 ~ + ( 4 ~ x s- ~ 4 + ^ 

" v ' ^ v ' ^ v ' 

within-subject deviation Stimulation effect residual 

where x k - = -7- V^-i x ! ct> with m the number of sub- 

•J Ylj ' *o — J- oj J 

ject undergoing stimulation Therefore, a part of the 



within-subject deviation is explained by the stimulation 
effect 

Let X be the (N x p) gene expression matrix on s = 
l,...,n subjects with G s stimulations (in the balanced 
case N = n x G, otherwise N = Yls=i G s )- According to 
equation (1): 

X — + Xjy^ + Xy^ 

offset term between-subject deviation within-subject deviation 

The matrix X.. represents the offset term defined as 
lN#. r > where In is the (N x 1) matrix containing ones and 
% T = (x} m , . . . , x?.); Xy is the between-subject matrix of size 
(N x p) denned by concatenating 1g s ^I s for each subject 
into Xy with x£ s = (x\. — x\> . . . , x^. — x?.); X w = X — X s . 
is the within-subject matrix of size (N x p), with X s . the 
matrix defined by concatenating the matrices 1g s *J for 
each subject into X s ., with xj. = (x\., ...,*%.). 

Similarly to the Analysis of Variance, it is easy to show 
that the sum of squares can be separated into three parts: 

||X|| 2 = ||X.|| 2 + ||X*|| 2 + ||XJ| 2 , (2) 

where ||X|| 2 = trace{X T X). Equation (2) can be used 
to evaluate the magnitude of the different sources of 
variation. 

The mixed-model described earlier can provide an anal- 
ysis for repeated measurements data in an unbalanced 
design. It can be viewed as an extension of a paired t-test 
to test the differences between paired observations. How- 
ever, to tackle some of the previously mentioned limita- 
tions of the approach, we propose to combine a multilevel 
approach and a multivariate approach as an interesting 
alternative. Indeed, the multilevel step splits the differ- 
ent parts of the variation while taking into account the 
repeated measurements on each subject. Since the stim- 
ulation effect from each subject can be separated from 
the between subject deviation (variation), it is possible to 
examine the differences in stimulation effect within the 
subjects in a much easier way than without the separation 
of the difference sources of variation [19]. Westerhuis et. 
al (2010) provided the rationale and showed the benefit 
of the multilevel approach in the analysis of multivari- 
ate paired (cross-over) data. In this paper where we aim 
at identifying genes discriminating the different stimula- 
tions, we propose to apply a multivariate approach on the 
within matrix X w which includes the stimulation effect, 
in the same spirit as in [18-20]. This approach is more 
powerful, as it takes into account not only the depen- 
dency between genes via the multivariate approach, but 
also the repeated measures between individuals and the 
stimulation effects via X w . 
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Extended method for two factors 

We propose to extend this approach for data with two 
factors: the time ('before' and after' vaccination), in addi- 
tion to the stimulation factor. Let X k - t be the expres- 
sion of a given gene k for subject s with stimulation 
; at time t = 1,2. In this context, the mixed model 
is defined as: 

' X sjt = 4 + + (<X*)sj + IP*)* + 4' 

^ k t = ^+a k + fi k + (afi) k t , 

where for a given gene /<, n k is the gene population mean 
(offset term); a k measures the fixed effect of stimula- 
tion fi k measures the fixed effect of time t; (otfi) k t is 
the interaction effect between the stimulation ; and the 
time t; n k ~ N(0,cr% ) is the random subject effect; 

(ot7r) k s - ~ N{0,<7an k ) measures the random interac- 
tion effect between the subject s and the stimulation ;; 
(fi7t) k t ~ N(0, &p nk ) measures the random interaction 
effect between the subject s and the time t; the residu- 
als € s j t ~ N(O f cr^ k ) and the variables 7t k , e k - p {pin) k - and 

(fi7t) k t are assumed to be independent. In the context of 
our application, the potential subject interactions effect 
with the stimulation and the time effect are confounded 
with the residuals terms since only one observation is 
available per subject for each level of both time and 
stimulation factors. 
According to the mixed model, we have: 

X<*j£ — X mmm I (pC g m m OC m m m ) 

offset term between-subject deviation 

( x sjt ~ X s -) 
within-subject deviation 

where the within-subject deviation can be further decom- 
posed as: 

(Xsjt ~ X l) = (4j.-X k .) + {x k t - X k .) 

Stimulation effect Time effect 
+ 

interaction effect 

+ (x k sJ .-x k . -x k ..+x k ) 

random inter: subject x Stimulation 

+ (*w-a£t -**..+*£.) 

v. ■> 

random inter: subject x Time 

+ (4j t - *§» - 4- - + 4 + £t + **■ _ ^ 

Residual 



The matrix representation gives: 




within-subject deviation 

= ^Stimulation H~ ^Time H~ ^Stimulation x Time H" ^Residual 



~l~ ^subject x Stimulation H~ -^subjectxTime 
random interaction 

Similar to the one-factor decomposition, the multivari- 
ate approach will be applied on the within matrix X w *, 
which includes stimulation, time and interaction effects. 

Discriminant analysis of one data set 

Once the multilevel approach has been applied to split 
up the variation in the data, a variant of the multivariate 
approach PLS Discriminant Analysis (called sparse PLS- 
DA) is applied on the within matrix X w or X w * in order to 
select discriminative genes between the groups of subjects 
on a single data set. 

Sparse PLS-DA 

Linear Discriminant Analysis (LDA) and Partial Least 
Squares Discriminant Analysis (PLS-DA, [15]) are 
exploratory approaches seeking the optimal linear com- 
binations of variables (genes) which best separate the 
sample groups. PLS-DA has been found to be a promising 
alternative to LDA since the latter faces numerical limita- 
tions when dealing with too many correlated predictors. 
Let X(N x p) be the within predictor matrix (to improve 
readability, the subscript w is removed) and Y(N x G) 
the response dummy matrix indicating the group of 
each sample. In PLS-DA, X is column standardized. The 
PLS-DA objective function to solve can be written as [15]: 

max cor (Y,Xu)var(Xu), (3) 

u 

where we denote by § = Xu the discriminant direc- 
tion vector, which is a linear combination of the original 
variables. The vector u is the associated loading vector 
indicating the weights of each variable in the linear com- 
bination §. Once step (3) has been performed and the first 
weight vector u\ has been extracted, both matrices Y and 
X are deflated such that the following loading vector u<i is 
orthogonal to the previous one. PLS-DA therefore outputs 
a set of loading weight vectors u\, U2, . . . , uh and associ- 
ated discriminant direction vectors § 2 > • • • > %h> where 
H is the number of PLS-DA dimensions (or deflations). 

The sparse version proposed by Le Cao et al. [16] uses 
the Lagrange form of PLS-DA to include a L\ constraint 
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on u in order to ensure that some Uj will be estimated 
as exactly zero (j = 1, . . . ,p). Thus, these corresponding 
variables will not contribute to the discriminant direction. 
sPLS-DA therefore allows variable selection for choosing 
the variables that best discriminate/separate the sample 
groups. 

Parameters tuning 

Two parameters need to be tuned in sPLS-DA: the num- 
ber of discriminant vectors H and the number of variables 
to select on each dimension (PLS component). Le Cao 
et al. [16] showed that for most cases, the user could 
effectively set H = G — 1. The number of variables to 
select is, however, a challenging issue as tuning criteria 
are often limited by the very small number of samples. In 
this study, we considered two criteria to guide the choice 
of this parameter, both of them are applied sequentially, 
dimension per dimension. 

Tuning criterion 1. One option is to use cross-validation 
to choose the optimal number of selected variables in 
order to avoid selection bias [22]. After this step, the full 
data are analyzed given this tuned parameter. We pro- 
pose to estimate the generalization error rate using k 
fold cross-validation. In the case of a very small sample 
size (< 15 subjects), the leave-one-out cross-validation 
(denoted "loo") can be used instead. In the specific con- 
text of repeated measures and in order to respect the data 
structure, the training set is composed of the measure- 
ments on all experimental units except the measurements 
on one subject s which defines the test set denoted X^. 
The test set prediction is denned by Y test = X^JP, where 
P is the regression coefficient matrix from sPLS-DA (see 
[16] for more details). The process is repeated for each 
subject and the classification error rate is averaged across 
all subjects. This process is tested for each number of 
variables to select (see Additional file 1: Figure S3 and 
Figure S6), and the "optimal" number of variables is then 
determined when the lowest error rate is obtained. 

Tuning criterion 2. In the case where the number of sub- 
jects is too small, an ad-hoc alternative approach was used 
on the whole data set by computing cor(Y,Xu)var(Xu) 
for each deflated matrix and with respect to the number 
of selected variables. This is similar to that proposed by 
Waaijenborg et al. [10] and Parkhomenko et al. [9]. The 
number of variables selected is chosen to maximize the 
criterion value. 

Integrative analysis of two data sets 

Similarly to the PLS -DA analysis, a more general PLS mul- 
tivariate approach can be applied on the matching within 
matrices X w (or X w *) and Z w (or Z w *). For this analysis 
however, the aim is to integrate two data sets in a non- 
supervised manner and select correlated variables from 
both data sets across the subjects. 



Sparse PLS 

Partial Least Square regression (PLS, [23]) is in fact the 
ancestor of PLS-DA and is applied in a non supervised 
context, where X(N x p) and Z(N x q) are two contin- 
uous within matrices of two different types of predictors 
{e.g. gene expression and cytokine secretion). In PLS, both 
X and Z are column standardized. To improve readabil- 
ity, the subscript w is removed from both these matrices. 
PLS relates X and Z by a linear multivariate model, while 
also modelling the structure of X and Z. PLS is particu- 
larly useful for analysing noisy, collinear, even incomplete, 
high dimensional data, see [24] for a review. 
PLS performs successive decompositions of X and Z into 
new variables (component scores) denoted by (£]_,..., % H ) 
for the X-scores and (o>i, . . . , (Oh) for the Z-scores. These 
scores should be few in number (H small), orthogonal 
to each other within each data set, and estimated as lin- 
ear combinations of the original variables from X and Z 
with their weights coefficients indicated in the associated 
loading vectors and (h = 1, . . . ,H) respectively. In 
matrix representation, we have X = SC r + £, Z — 
&D T + F, where E and F are the residual matrices, and 
the column matrices in C and D are the coefficients from 
the local regressions of the score vectors § h (co^) onto the 
current deflated matrices denned as X^ = X^_i — % h c' h 
and Z h = Z h -i - (*d f h , where c h = X\_£ h l%£ h and 
dh = Yh-iVh/v'hVh- 

PLS relates both matrices by maximising the covariance 
between each pair of scores The PLS objective 

function is: 

arg max cov{Xp l up l ,Zp l vj 1 ) h = l...H. (4) 

11^11=1,11^11=1 

This PLS form is often referred to as "PLS2 mode A" in 
the literature [25] where, similar to Canonical Correlation 
Analysis, the aim is to model a 'bidirectional' relation- 
ship between the two data sets (to maximise the common 
information between the two data sets), as opposed to 
a unidirectional' relationship when using a regression 
model. The sparse version, sPLS, enables variable selec- 
tion from both sets by including L\ penalizations on both 
Uh and v/z simultaneously in (4), which is solved with 
a Lagrangian form (see [7,8] for more details about the 
methodology and the algorithm). The result is a subset of 
correlated variables from both X and Z indicated in the 
loading vectors (uh,Vh) for each PLS dimension h, and a 
set of score vectors ($^,(0^) that are useful for graphical 
representations. 

Parameter tuning 

As an extension to the tuning criterion 2 from the previous 
section, and similar to what was proposed by Waaijenborg 
et al. [10] and Parkhomenko et al. [9], the number of PLS 
components and number of variables to select in each 



Liquet etal. BMC Bioinformatics 201 2, 13:325 
http://www.biomedcentral.eom/1 471 -21 05/1 3/325 



Page 6 of 14 



step can be tuned by computing coviX^u^Zj^Vh), which 
is the criterion maximized in sPLS2 mode A for each PLS 
dimension h (see equation (4)). For an optimal number of 
selected variables from both datasets, one would expect 
this criterion to achieve also a maximum. 

Results and discussion 

We first present the results of a short simulation study 
to show the importance of using a multilevel approach 
in comparison to a standard sparse partial least square 
analysis on the original data. We then apply the proposed 
multilevel approach on an HIV- vaccination study. 

Simulation study 
Simulated model 

A simulation study based on the following mixed effects 
model was performed: 

4 = ^ + 7^ + 6*, 5 = 1,. ..,12,; = 1,. ..,4, 

with n k ~ N(0,a£ k ), e k ~ N(0,a* k ), where n k and 

e k j are independent. From this model, 10 clusters of 100 
genes each were generated (k = 1, . . . , 1000). For any 
given pair of genes k\ and I<2 in the same cluster, a pair- 
wise correlation for X k f and X k J is specified by assuming 

coring 1 ,7ts 2 ) = p and cor '(ef- 1 , ef?) = p, while genes 
belonging to different clusters are taken to be uncorre- 
cted. The random variables n k and € k - from same cluster 
are generated from the multivariate normal distribution 
(n k \ . . . ,7r s * 100 ) ~ Wioo(Oioo, ^jt) where the variance- 
covariance matrix is a (100 x 100) matrix with cr% 
along the diagonal and pcr% for the others terms; and from 

the multivariate normal distribution (e k K . . . , 6^ 100 ) ~ 

•v -v 

A/ioo(Oioo> ^e) where the variance-covariance matrix E 6 is 
a (100 x 100) matrix with along the diagonal and pcr^ 
for the others terms. 

To mimic the application, clusters of genes discrimi- 
nating 4 conditions were generated (the 4 stimulations 
denoted LIP05, GAG+, GAG- and NS) , where the 
mean effect of each stimulation is specified by fi k = 
(/Xp /ji k , ii k , /x^) r , according to the following: 

• 2 gene clusters discriminate (LIP05, GAG+) versus 
(GAG-, NS) with fi k = (4, 4, 0, 0) r and 

fi k = (3,3,0,0) r . 

• 2 gene clusters discriminate LIP05 versus GAG+, 
while GAG+ and NS have the same effect: 

fi k = (5, 2, 0.2, 0.2) T and fi k = (5, 2, 0, 0) r . 

• 2 gene clusters discriminate GAG- versus NS, while 
LIP05 and GAG+ have the same effect: 

fi k = c(l, 1, 5, 2) T and fi k = c(0, 0, 5, 2) T . 

• the 4 remaining clusters represent noisy signal (no 
stimulation effect): fi k = c(0, 0, 0, 0) r and 

fi k = (0.5, 0.5, 0.5, 0.5) T . 



The intra cluster correlation was either set to p = 0.7 or 
0.8. Different values for cr% and a^ k were studied, but for 
the sake of conciseness the results are only presented for 

a n/< = 2 and a €k = 0.5. 

Numerical results 

From the simulated data, the within matrix was computed 
and applied to multilevel sPLS-DA. Figure 1 displays the 
sample representation for the first 3 axes or dimensions 
for one simulation run. 

Firstly, in order to highlight the benefit of the multi- 
level approach in comparison to the multivariate approach 
without the split-up variation step, a prespecified number 
of genes was selected on each dimension in order to assess 
the ability of each approach to select the true relevant 
genes. As expected, 3 components (linear combinations of 
200 genes) were sufficient to discriminate the effect of the 
4 stimulations. Multilevel sPLS-DA (applied on the within 
matrix) selected 92% of the true simulated discriminative 
genes as compared to 75% of the true discriminative genes 
for classical sPLS-DA (applied on the original matrix), see 
Table 1. The hierarchical clustering of the genes selected 
by sPLS-DA on the within matrix (Figure 2) confirmed 
the discriminatory ability of these genes to separate the 4 
groups of samples. As expected, a group of 6 gene clusters 
can be observed. On the contrary, we did not observe such 
clusters when applying sPLS-DA on the original matrix 
(not shown). 

Secondly, leave-one-out cross-validation was performed 
on each simulation run to evaluate the error rate of clas- 
sification of classical sPLS-DA or multilevel sPLS-DA 
(Table 2). The classification error rate was evaluated for 
different number of genes selected on each component. 
For example, the average classification error rate for multi- 
level sPLS-DA was of 0.009 for 200 genes selected on each 
of the 3 axes compared to an error rate of 0.268 for the 
same parameters with classical sPLS-DA. 

Application to HIV vaccine evaluation 
Description of the study 

The data come from a trial evaluating a vaccine based 
on HIV-1 lipopeptides in HIV-negative volunteers [26]. 
The vaccine (HIV-1 LIPO-5 ANRS vaccine) contains five 
HIV-1 amino acid sequences coding for Gag, Pol and 
Nef proteins. A subsample of 12 vaccinated participants 
was randomly selected and experiments were performed 
before and after vaccination. The data consist of moni- 
tored cytokine secretion and gene expression measure- 
ments from purified in vitro stimulated Peripheral Blood 
Mononuclear Cells (PBMC). Cytokine secretion was anal- 
ysed by cytokine multiplex (millipore) in the supernatant 
of PBMC after 11 -day-culture, and the data set consists 
of 10 cytokines measurements (IFNy, IL1/3, IL2, IL5, IL6, 
IL10, IL13, IL17, IL21, and TNFa). Gene expression was 
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analysed using the Illumina HumanHT-12 v4 Expression 
BeadChip on PBMC before (WO) and 14 weeks after vac- 
cination (W14), 6 hours after in vitro stimulation by either 
(1) all the peptides included in the vaccine (LIPO-5), or (2) 
the Gag peptides included in the vaccine (GAG+) or (3) 
the Gag peptides not included in the vaccine (GAG-) or 
(4) without any stimulation (NS). 

Preprocessing 

Background correction, log2 transformation and quan- 
tile normalisation were applied on the gene expression 
data using the R limma package. Probes were further 
prefiltered for each time point (before and after vaccina- 
tion) using a P-value detection (<1% in all samples). The 
preprocessed data set contained the expression of 25,109 



probes for 12 subjects for 4 types of stimulation before 
vaccination (WO) and the expression of 24,687 probes 
after vaccination (W14). Some samples were not available 
due to DNA quality issues, resulting in 44 samples at WO 
and 42 samples at W14. For the multilevel approach with 
two factors, the analysis was performed on the common 
prefiltered probes before and after vaccination (21,350 
probes in total). 

The statistical analysis was performed on the probe 
expression, but the results were biologically interpreted at 
the gene level. 

Discriminant analysis on the transcriptomics data 

First we present results obtained using a mixed model and 
discuss some potential limitations of this method in the 



Table 1 Simulation study 





Component 1 


Component 2 


Component 3 


All 


classical sPLS-DA 


58.0 


75.0 


87.2 


78.2 


multilevel sPLS-DA 


82.8 


95.6 


93.1 


92.0 



Percentage of the number of true selected genes selected by classical sPLS-DA or multilevel sPLS-DA on each component or dimension (averaged over 1 00 simulation 
runs); 200 genes were selected on each component. 
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Figure 2 Stimulation study. Hierarchical clustering (Euclidian distance and Ward method aggregation) of the genes selected with multilevel 
sPLS-DA. Samples are represented in columns and genes in rows. 



context of small sample size. Then we present the results 
obtained using multilevel sPLS-DA for one and two-factor 
analyses. To shorten the length of the paper, some results 
have been moved in Additional file 1. The R code used for 
the analysis of this study is provided in Additional file 2. 

Mixed model 

The one-level mixed model was applied to the W14 tran- 
scriptomics data. We used the mle function from the R 
package nlme with the maximum likelihood method for 



the estimation of the different models. A global test (like- 
lihood ratio test) followed by an FDR multiple correction 
(5%) identified 2308 DE genes in at least one of the stim- 
ulation. Pairwise comparisons based on Wald test (FDR = 
5%) were then performed to compare LIP05 vs. NS (2108 
DE genes), GAG+ vs. NS (1087 DE genes) and GAG-vs. 
NS (209 DE genes). The summary of the results is avail- 
able in Section 1 of the Additional file 1. In our case study, 
the clustering analysis of the 100 most significant differen- 
tially expressed genes selected by the mixed model failed 
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Table 2 Simulation study 



Number of 
genes 




Original matrix 






Within matrix 




1 component 


2 components 


3 components 


1 component 


2 components 


3 components 


25 


0.535 


0.369 


0.312 


0.500 


0.271 


0.024 


50 


0.530 


0.364 


0.311 


0.500 


0.265 


0.016 


75 


0.527 


0.360 


0.306 


0.500 


0.261 


0.013 


100 


0.524 


0.354 


0.300 


0.500 


0.258 


0.011 


125 


0.522 


0.351 


0.296 


0.500 


0.257 


0.009 


150 


0.520 


0.343 


0.285 


0.500 


0.250 


0.008 


175 


0.518 


0.335 


0.281 


0.500 


0.243 


0.009 


200 


0.516 


0.327 


0.268 


0.500 


0.234 


0.009 


225 


0.514 


0.323 


0.269 


0.500 


0.227 


0.009 


250 


0.512 


0.316 


0.267 


0.500 


0.220 


0.008 


275 


0.510 


0.314 


0.266 


0.500 


0.207 


0.007 


300 


0.510 


0.306 


0.262 


0.500 


0.196 


0.007 


325 


0.509 


0.299 


0.260 


0.500 


0.182 


0.007 



Classification error rate estimation using leave-one-out cross-validation for classical sPLS-DA and multilevel sPLS-DA, with respect to the number of genes selected on 
each component (averaged over 100 simulation runs). 



to discriminate the four stimulations (Additional file 1: 
Figure S2). 

The univariate mixed model approach is commonly 
used to analyse data with repeated measurement with an 
unbalanced design. However, several reasons favor the use 
of a multilevel approach in this high dimensional setting. 
Apart from the already mentioned problem of numerous 
independent tests and the requirement to apply multiple 
correction [27], another limitation is the sensitivity of the 
FDR threshold (and therefore the number of declared DE 
genes) to the total number of test performed. The latter 
depends on the preprocessing method used to filter the 
probes. Another issue encountered was problems of con- 
vergence with both the maximum likelihood (ML) and 
the restricted ML methods due to the small number of 
samples in this data set. The asymptotic likelihood ratio 
test used for fixed effects has been reported to be anti- 
conservative in [28]. The authors recommended to use the 
F-test which still poses the issue of the choice of the num- 
ber of degrees of freedom with a small number of samples 
[29,30]. 

Multilevel approach with one factor 

A multilevel sPLS-DA analysis was performed on the 
W14 transcriptomics data, with H = 3. Respectively 30, 
137 and 123 genes were selected with the approach on 
each dimension according to the tuning criterion 1 for 
the most parcimonious model. Although k fold cross- 
validation would have been preferable to use, loo was used 
in this study given the small number of subjects. The 
following loo' classification error rates (0.48, 0.26, 0.24) 
were obtained on the first three sPLS-DA dimensions 



compared to (0.48, 0.36, 0.38) when applying sPLS-DA on 
the original matrices (see Additional file 1: Figure S3). 

Given the expression of these 290 selected genes, 
Figures 3(b) and 3(c) highlight a good separation between 
the four stimulations. These sample representations 
obtained from sPLS-DA reveal that the first compo- 
nent discriminates the stimulation LIP05 versus the 
other stimulations, while the second component discrimi- 
nates the stimulation GAG+ versus the other stimulations 
and the third component discriminates the stimulation 
GAG- versus the others. Therefore, the first two compo- 
nents separated the stimulations according to the pep- 
tides included in the vaccine. As expected there was a 
clear separation between LIP05 and other stimulation 
conditions. Especially, a part of the differential effect of 
LIP05 compared to GAG+ could be due to the lipid 
tail or perhaps to the effect of the other peptides of 
LIP05. GAG- and NS were not distinguishable on the 
first two components of the sPLS-DA (Figure 3(b)). This 
last result is not surprising as no specific response is 
expected from peptides not included in the vaccine after 
vaccination. 

Figure 3(a) displays the unsupervised clustering of 
the 290 selected probes. A cluster of 11 genes MT1M, 
C20ORF127, MT2A, MT1A, MT1G, MT1F, LOC441019, 
MT1X, MT1H, MTE, MT1E, was removed to improve 
visualisation. These genes selected on dimension 1 were 
all overexpressed in LIP05 stimulation (see Additional 
file 1: Figure S5). 

Several clusters of genes which expression seemed 
related to each type of stimulation could be identified. 
Cluster 1 included a subset of genes downregulated in 
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Figure 3 Multilevel sPLS-DA analysis on the transcriptomics data with one factor (W14). (a) Unsupervised clustering analysis with Euclidian 
distance and Ward method of the 290 genes selected by sPLS-DA. Samples are represented in columns and genes in rows, (b) and (c) sPLS-DA 
sample representation for dimensions 1 -2 (b) or 1 -3 (c). 



GAG-, in cluster 2 the genes were overexpressed in GAG-, 
while cluster 3 included a subset of genes overexpressed in 
LIP05 and GAG+, and cluster 4 was composed of a subset 
of genes mainly overexpressed in GAG+. The advantage of 
sPLS-DA is its ability to select genes related to a specific 
stimulation group on each component. For instance, clus- 
ters 1 and 2 included 126 out of the 137 probes selected 
on the third dimension which separated GAG- from the 
other stimulation groups (Figure 3(c)). Cluster 3 included 
19 out of the 30 probes selected on the first component, 
and 12 probes from the second component in order to dis- 
criminate stimulations LIP05 and GAG+, while the fourth 
cluster included 72 out of the 123 probes selected on the 
second component which separated GAG+ vs. the other 
stimulations (Figure 3(b)). Interestingly, some of the genes 
in this cluster belong to the TNF family (TNFSF13B) 
or interferon family (ISG20L2) demonstrating a specific 
effect of the GAG peptides on gene expression related to 
the immune response. 

Note that the same analysis was also performed on WO 
but identified much fewer discriminative genes (30 genes 



in total), indicating that there was a change in expression 
level after vaccination (see Additional file 1: Figure S8). 

Multilevel approach with two factors 

A multilevel sPLS-DA analysis was performed on the 
within matrix X w * including the time factor W0 and W14 
in addition to the stimulation factor for the transcrip- 
tomics data. The complexity of this cross-over design 
implied more conditions (4 x 2 = 8) to be compared for 12 
unique subjects. Therefore, the tuning criterion 2 gave for 
each sPLSDA dimension a maximum correlation of (0.94, 
0.96, 0.95) for variable selection sizes of 30, 40, 150 genes 
on each dimension. A sudden drop in the correlation value 
in the fourth dimension (0.62) indicated that 3 sPLS-DA 
dimensions should be chosen for this analysis. 

The hierarchical clustering of the 220 selected genes 
indicated a very satisfying separation of both time and 
stimulation factors (Figure 4(a)). The first component dis- 
criminated the stimulations GAG+/LIP05 vs. GAG-/NS 
irrespective of the time, whereas the second component 
reflected the time effect W0 vs. W14 (Figure 4(b)). This 
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Figure 4 Multilevel sPLS-DA analysis on the transcriptomics data with two factors stimulation and time, (a) Unsupervised clustering analysis 
with Euclidian distance and Ward method of the 220 genes selected by sPLS-DA. sPLS-DA sample representations for dimensions 1-2 (b) or 1-3 (c). 



suggests that the stimulation groups are easier to sep- 
arate than the time points by the approach. On this 
second dimension, relevant genes related to the immune 
response were selected (CD8a, CD79a, CD19, SLAMF6). 
The third component (Figure 4(c)) separated GAG+ vs. 
LIP05 irrespective of the time and several of the genes 
selected on this third dimension were found to be metal- 
lothionein genes (MT1M, MT2A, MT1A, MT1G, MT1F, 
MT1X, MT1H, MTE, MT1E) that may be stimulated by 
the lipid tail of LIP05. From a biological point of view, 
the significance of genes from metallothionein family in 
the context of HIV is not clear although some results 
have been recently reported [31]. These authors showed 
an increased resistance to apoptosis of immune-activated 
monocyte linked to the increase in Metallothionein (MT) 
gene expression and intracellular zinc levels. 

Integrative analysis 

Multilevel sPLS enables the integration of data mea- 
sured using different assays. This approach differs from 



multilevel sPLS-DA as the aim is to select subsets of genes 
and cytokines which are highly correlated (positively or 
negatively) across the samples. While the paired structure 
of the data is still taken into account in the analysis via 
the decomposition of the within matrices and ZjJ,, the 
analysis is completely unsupervised: no prior knowledge 
about the samples groups is included. 

Multilevel approach 

Multilevel sPLS was applied on the within matrices of 
the gene and cytokine data sets after vaccination. Given 
the very small number of cytokines, all cytokines were 
selected in the model, and the tuning of the number 
of variables to select was only performed on the gene 
expression data set. Respectively, a selection of 50, 1 
and 60 genes was performed each of the sPLS dimen- 
sion, corresponding to a correlation of (0.86, 0.62 and 
0.84). A drop of the subsequent correlations for the 
other dimensions guided the choice of 3 components in 
the model. 
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Although unexpected and indicated by the tuned cor- 
relation value of 0.62, the selection of one single gene on 
the second dimension was not surprising given the sam- 
ple representation that was obtained (see Additional file 1: 
Figure Sll): while the first and third dimensions separated 
LIP05, GAG+ and GAG-/NS, the second dimension did 
not seem to highlight any interesting pattern in the data. 
The approach might reveal some unknown phenomenon 



in the data for this component that would need to be fur- 
ther investigated. 

Nonetheless, sPLS multilevel was able to identify very rel- 
evant information from both data sets. Graphical tools 
help to unravel the correlation structure between the 
two data set such as Clustered Image Maps (CIM). 
Figure 5 reveals clusters of selected genes associated 
with cytokines secretion. These genes were not known to 



Color key 




o 



genes 

Figure 5 Integrative analysis of gene expression and cytokine secretion for W14. Clustered Image Maps (CIM) obtained from multilevel sPLS. 
Selected genes are represented in columns and cytokines in rows. 
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participate in the cytokines pathways but can be seen as 
gene signatures to predict future cytokine response. For 
example Figure 5 highlights relevant clusters of cytokines, 
such as the proximity of the two T-Helper type 2 (Th2) 
cytokines IL5 and IL13. Also, IL17 and IL21 have often 
been associated in the type 17 response. The correlations 
between genes and cytokines were similar for the pairs 
(IL5, IL13), (IL21,ILlb) and (TNF,IL6) underlying poten- 
tial similar pathways related to the production of these 
cytokines. 

Conclusion 

In this paper, we have proposed a two-step analysis com- 
bining a multilevel approach and a multivariate approach 
to analyze repeated measures of gene expression. The 
multilevel approach first extracts the within-sample varia- 
tion while the multivariate approach applied on the within 
matrix takes into account the dependency between the 
variables. The multilevel approach was extended for one 
and two factors analyses. 

Two multilevel variants were proposed with either 
sPLS-DA or sPLS. The multilevel sPLS-DA approach 
selects genes separating the groups of subjects on a single 
data set. The simulation study comparing multilevel sPLS- 
DA and the sPLS-DA applied on the original data demon- 
strated the good performance of the model. The multilevel 
sPLS approach integrates two experiments made on differ- 
ent platforms but on the same subjects, and selects subsets 
of correlated variables from both sets. 

The application of both types of approaches on the 
HIV-1 vaccine trial showed their ability to highlight the 
stimulation groups and to select biologically relevant 
genes related to immune response. Hence, our combined 
multilevel approach may help in finding signatures of 
vaccine effect and allows for a better understanding of 
immunological mechanisms activated by the interven- 
tion. Future work will include a thorough analysis on the 
gene/probe annotations to fully understand the mecha- 
nistic link between gene differential expression, cytokine 
secretion according to the various stimulations. 

Endnote 

a http://www.math.univ-toulouse.fr/~biostat/mixOmics. 
Additional files 

Additional file 1 : Supplementaries results regarding the VAC1 8 
study experiments from two assays. 

Additional file 2: R code used for the analysis of the VAC1 8 study. 
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